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Abstract 

This paper summarizes recent results on applying the method of par- 
tial least squares (PLS) in a reproducing kernel Hilbert space (RKHS). A 
previously proposed kernel PLS regression model was proven to be com- 
petitive with other regularized regression methods in RKHS. The family 
of nonlinear kernel-based PLS models is extended by considering the ker- 
nel PLS method for discrimination. Theoretical and experimental results 
on a two-class discrimination problem indicate usefulness of the method. 


1 Introduction 

The partial least squares (PLS) method [18, 19] has been a popular modeling, 
regression and discrimination technique in its domain of origin — Chemometrics. 
PLS creates orthogonal components (scores, latent variables) by using the ex- 
isting correlations between different sets of variables (blocks of data) while also 
keeping most 'of the variance of both sets. PLS has proved to be useful in situ- 
ations where the number of observed variables is significantly greater than the 
number of observations and high multicollinearity among the variables exists. 
This situation is also quite common in the case of kernel-based learning where 
the original data are mapped to a high-dimensional feature space corresponding 
to a reproducing kernel Hilbert space (RKHS). Motivated by the recent results 
in kernel-based learning and support vector machines [15, 3, 13] the nonlinear 
kernel-based PLS methodology was proposed in [11]. In this paper we summa- 
rize these results and shovz how the kernel PLS approach can be used for mod- 
eling relations between sets of observed variables, regression and discrimination 
in a feature space defined by the selected nonlinear mapping — kernel function. 
We further propose a new form of discrimination based on a combination of 
the kernel PLS method for discrimination with state-of-the-art support vector 
machine classifier (SVC) [15, 3, 13]. The advantage of using kernel PLS for 
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dimensionality reduction in comparison to kernel principal components analysis 
(PCA) [14, 13] is discussed in the case of discrimination problems. 


2 RHKS - basic definitions 

A RKHS is uniquely defined by a positive definite kernel function K(x,y); i.e. 
a symmetric function of two variables satisfying the Mercer theorem conditions 
[?j 3]. Consider K ( .) to be defined on a compact domain X x X; X C R N . 
The fact that for any such positive definite kernel there exists a unique RKHS 
is well established by the Moore- Arons zj an theorem [1]. The form K(x,y) has 
the following reproducing property 

/(y) = (/(*), K(x, y)) H V/6«, 

where .) H is the scalar product in H. The function K is called a reproducing 
kernel for H. 

It follows from Mercer’s theorem that each positive definite kernel K{x^y) 
defined on a compact domain X x X can be written in the form 

s 

K( x >y) = S<oo, (1) 

i-1 

where are the eigenfunctions of the integral operator Tr : L 2 (X) -4 

L 2 (X) 

(Tic/)(x) = f K{x,y)f(y)dy V/ € L 2 (X) 

Jx 

and {Xi > 0}^ are the corresponding positive eigenvalues. The sequence 
creates an orthonormal basis of 1-L and we can express any function 

/ € H as f(x) = a i 4>i{x) for some a x - £ 7Z. This allows to define a scalar 
product in V, as 


(f(x),h(x)) n 


and the norm 


Rewriting (1) in the form 


i=l i = 1 # t=l Al ' 
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it becomes clear that any kernel K(x,y) also corresponds to a canonical (Eu- 
clidean) dot product in a possibly high-dimensional space T where the input 
data are mapped by 

$ ; X -4 T _ 

x -»• (\Ai0i (z), , \f\s4>s{x)) ■ 


The space T is usually denoted as a feature space and {{-v/X&MJiLi , x £ 
X} as feature mappings. The number of basis functions <£*(.) also defines the 
dimensionality of T. It is worth noting that we can also construct a RKHS and 
a corresponding feature space by choosing a sequence of linearly independent 
functions (not necessarily orthogonal) {^(a:)}^! and positive numbers a; to 
define a series (in the case of S = oo absolutely and uniformly convergent) 

K(x,y) = J2f=i 


3 Kernel Partial Least Squares 

Because the PLS technique is not widely known we first provide a description of 
linear PLS which will simplify our next description of its nonlinear kernel-based 
variant [II]. 

Consider a general setting of the linear PLS algorithm to model the relation 
between two data sets (blocks of observed variables) X and y. Denote by 
x 6 X C 7Z N an TV-dimensional vector of variables in the first block of data 
and similarly y E y C 7Z M denotes a vector of variables from the second set. 
PLS models relations between these two blocks by means of latent variables. 
Observing n data samples from each block of variables, PLS decomposes the 
(n x N) matrix of zero-mean variables X and the (n x M) matrix of zero- mean 
variables Y into the form 


X = TP r + F 

Y = UQ t + G (3) 

where the T, U are (n x p) matrices of the extracted p orthogonal components 
(scores, latent variables), the (N x p) matrix P and the (M x p) matrix Q 
represent matrices of loadings and the ( n x N) matrix F and the (n x M) matrix 
G are the matrices of residuals. The PLS method, which in its classical form is 
based on the nonlinear iterative partial least squares (NIPALS) algorithm [18], 
finds weight vectors w, c such that 

[aw(t,u )] 2 = [cov(Xw, Yc)] 2 = marj r j = | s j = i[c(9^(Xr, Ys)] 2 

where cov(t : u) = t T u/n denotes the sample covariance between the two score 
vectors (components). The NIPALS algorithm starts with random initialization 



of the 1 -score vector u and repeats a sequence of the following steps until 
convergence: 

1) w = X T u/(u T u) 4) c = Y T t/(t T t) 

2) ||w|| -4 1 5) u = Yc/(c r c) 

3) t — 2Cw 6) repeat steps 1. — 5. until convergence 

However, it can be shown [5] that we can directly estimate the weight vector w 
as the first eigenvector of the following eigenvalue problem 

X t YY t Xw = Aw (4) 

The X-scores t are than given as 

t = Xw , ( 5 ) 

We can similarly derive an eigenvalue problem for the extraction of t,u and c 
estimates [5] and solve one of them for the computation of the other vectors. 
The nonlinear kernel PLS method is based on mapping the original input data 
mow a high-dimensional feature space JF. in this case we usually cannot compute 
the vectors wand c.' Thus, we need to reformulate the NIPALS algorithm into 
its kernel variant [6, 11]. Alternatively, we can directly estimate the score vector 
t as the first eigenvector of the following eigenvalue problem [5, 9] (this can be 
easily shown by multiplying both sides of (4) by X matrix and using (5)) 

XX r YY r t - At (6) 

The Y-scores u are than estimated as 

u = YY r t (7) 

Now, consider a nonlinear transformation of x into a feature space T. Using 
the straightforward connection between a RKHS and T we have extended the 
linear PLS model into its nonlinear kernel form [11]. Effectively this extension 
represents the construction of a linear PLS model in J~ . Denote as the (n x S ) 
matrix of mapped A-space data $(x) into an 5-dimensional feature space T. 
Instead of an explicit mapping of the data we can use property (2) and write 

where K represents the (n x n) kernel Gram matrix of the cross dot products 
between all input data points {$(x)}JL x ; i.e. K = K{x^Xj) where K(., .) is a 
selected kernel function. We can similarly consider a mapping of the second set. 
of variables y into a feature space and denote by the (n x Si) matrix of 
mapped iV-space data #(y) into an 5i -dimensional feature space T\. We can 
write 





where Ki similar to K represents the (n x n) kernel Gram matrix given by the 
kernel function K \ Using this notation we can reformulate the estimates 
of t (6) and u (7) into its nonlinear kernel variant 

KKit = At (8) 

u = Kit 

At the beginning of this section we assumed a zero-mean regression model. 
To centralize the mapped data in a feature space T we can simply apply the 
following procedure [14, 11] 

K<-(I— ll„l£)K(I-ll n l£) 

where I is an n-dimensional identity matrix and l n represent the (nxl) vector 
with elements equal to one. The same is true for Ki . 

After the extraction of new score vectors t,u the matrices K and Ki are 
deflated by subtracting their rank-one approximations based on t and u. The 
different forms of deflation correspond to different forms of PLS (see [17] for a 
review) r The PLS Mode A is based on rank-one deflation of individual block 
matrices using corresponding score and loading vectors. This approach was 
originally design by H. Wold [18] to model the relation between the different 
blocks of data. Because (4) corresponds to the singular value decomposition 
of the transposed cross-product matrix X r Y, computation of all eigenvectors 
from (4) at once involves implicit rank-one deflation of the overall transposed 
cross-product matrix. This form of PLS was used in [12] and in accordance 
with [17] we denote it as PLS-SB. The kernel analog of PLS-SB results from 
the computation of all eigenvectors of (8) at once. PLS1 (one of the blocks 
has single variable) and PLS2 (both blocks are multidimensional) as regression 
methods use a different form of deflation which we describe in the next section. 

3.1 Kernel PLS Regression 

In kernel PLS regression we estimate a linear PLS regression model in a feature 
space T . The data set y represents a set of dependent output variables and- in 
this scenario we do not have reason to consider a nonlinear mapping of the y 
variables into a feature space T \ . This simply means that we consider K a = 
YY r and T\ to be the original Euclidian VJ 4 space. In agreement with the 
standard linear PLS model we further assume the score variables are 

good predictors of Y. We also assume a linear inner relation between the scores 
of t and u; i.e. 

U = TB + H 

where B is the (p x p) diagonal matrix and H denotes the matrix of residuals. 

In this case, we can rewrite the decomposition of the Y matrix (3) as 

Y = UQ r + F = (TB + H)Q r + F = TBQ r + (HQ T + F) 



which defines the considered linear PLS regression model 

Y = TC t + F* 

where C T — BQ r now denotes the ( p x M) matrix of regression coefficients 
and F* = HQ^ -f F is the Y-residual matrix. 

Taking into account normalized scores t we define the estimate of the PLS 
regression model in T as [11] 

Y = KU(T t KU)“ 1 T t Y = TT t Y . (9) 

It is worth noting that different scalings of the individual Y-score vectors 
do not influence this estimate. The deflation in the case of PLS1 and PLS2 is 
based on rank-one reduction of the and Y matrices using a new extracted 
score vector t at each step. It can be written in the kernel form as follows [11] 

K «- (I — tt T )K(I - tt r ) ; K 2 «— (I — tt r )Ki(I — tt T ) 

This deflation is based on the fact that we decompose the $ matrix as ^ ^ 
^ — tp 7, = $ - tt 3 "^, where p is the vector of loadings corresponding to the 
extracted component t. Similarly for the Y matrix we cari write Y f- Y-tc r == 
Y - tt r Y. 

Denote d m = U(T T KU) _;l T T Y m , m = 1, . . . , M w r here the (n x 1) vector 
Y m represents the m-th output variable. Then the solution of the kernel PLS 
regression (9) for the m-th output variable can be written as 

n 

i - 1 

which agrees with the solution of the regularized form of regression in RKHS 
given by the Representer theorem [16, 11]. Using equation (9) we may also 
interpret the kernel PLS model as a linear regression model of the form 

p m (x, c m ) = c?t 1 (x) + c 2 m t 2 (x) + . . . + c™t p (x) = ^2 c?ti(x) 

i= 1 

where {ti(x)}^ =1 are the projections of the data point x onto the extracted p 
components and c m = x r Y m is the vector of weights for the m-th regression 
model. 

Although the scores are defined to be vectors in an F-dimensional 

feature space T we may equally represent the scores to be functions of the 
original input data x. Thus, the proposed kernel PLS regression technique 
can be seen as a method of sequential construction of a basis of orthogonal, 
functions which are evaluated at the discretized locations It 

is important to note that the scores are extracted such that they increasingly 
describe overall variance in the input data space and more interestingly also 
describe the overall variance of the observed output data samples. 



3.2 Kernel PLS Discrimination 


Consider the ordinary least squares regression with outputs Y to be an indi- 
cator vector coding two classes with the values -1-1 and -1, respectively. The 
regression coefficient vector from the least squares solution is than proportional 
to the linear discriminant analysis (LDA) direction [4]. Moreover, if the num- 
ber of samples in both classes is equal, the intercepts are the same resulting in 
the same decision rules. This close connection between LDA and least square 
regression motivates the use of PLS for discrimination. Moreover, a very close 
connection bet'ween Fisher’s LDA (FDA) and PLS-SB methods for multi-class 
discrimination has been shown in [2], Using the fact that PLS can be seen as a 
form of penalized canonical correlations analysis (CCA) 1 

fccw(t,u )] 2 = [ccw(Xw, Yc )] 2 = nar(Xw)[corr(Xw, Yc)] 2 var(Yc) 

it was suggested [2] to remove the not meaningful y ~ space penalty var( Yc) 
in the PLS discrimination scenario where the Y-block of data is coded in the 
following way 

/ l ni 0 n , . . . 0 n , \ 

0 n2 ln2 * * • 

\ On* On* ■ ■ ■ ) 

Here, 1 denotes the number of samples in each class. This modified PLS 

method is than based on eigen solutions of the between classes scatter matrix 
which connects this approach to GCA or equivalently to FDA [4, 2]. More 
interestingly, in the case of two classes the direction of only one PLS component 
will be identical with the first PLS component found by the PLS1 method 
w r ith the Y-block represented by the vector with dummy variables coding two 
classes. However, in the case of PLS1 we can extract additional components each 
possessing the same similarity with directions computed with CCA on deflated 
X-block matrices. This provides a more principled dimensionality reduction in 
comparison to standard PCA based on the criterion of maximum data variation 
in the Y-space alone. 

On several classification problems the use of kernel PCA for dimensionality 
reduction and/or de-noising followed by linear SVC computed on this reduced 
Y-space data representation has shown good results in comparison to nonlinear 
SVC using the original data representation [13, 14]. However, previous theoret- 
ical results suggest to replace the kernel PCA data preprocessing step with the 
more principled kernel PLS. The advantage of using linear SVC as the follow 
up step is motivated by the construction of an optimal separating hyperplane in 
the sense of maximizing of the distance to the closest point from either class 

x In agreement with previous notation var{.) and corr(., .) denotes the sample variance and 
correlation, respectively. 



Method 

KPLS-SVC 

SVC 

| KFDA 

RBF 

avg. error [%} 

10.6 ± 0.4 

r 11.5 ± 0.7 j 

10.8 ± 0.5 

10.8 ± 0.6 


Table 1: Comparison of results between kernel PLS with I/-SVC (KPLS-SVC), C- 
Support Vector Classifier (SVC), kernel Fisher's LDA (KFDA) and Radial Basis Func- 
tions classifier (RBF). The results represents average and standard deviation of the 
misclassification error using 100 different test sets. 


[15, 3, 13]. In comparison to nonlinear kernel FDA [8, 13] this may become more 
suitable in the situation of non-Gaussian class distribution in a feature space T '. 
Moreover, when the data are not separable the SVC approach provides a way 
to control the extent of this overlap. 


4 Experiments 

On an example of a two-class discrimination problem (Fig. l(Ieffc)) we demon- 
strate good results using the proposed combined method of nonlinear kernel- 
based PLS components extraction and the subsequent linear v - SVC [13] (de- 
note this method KPLS-SVC). We have used the banana data set obtained 
via http : / /ww .first . gmd . de/~raetsch. This data repository provides the 
complete 100 partitions of training and testing data used in previous experi- 
ments [10, 8, 13]. The repository also provides the value of the Gaussian ker- 
nel iV(x, y) = exp(— ||x — y|| 2 /h) width parameter (h) found by 5-fold cross- 
validation (CV) on the first five training data partitions and used by the C-SVC 
classifier [13] and kernel FDA methods [8], respectively (on this data set the 5- 
fold CV method results in the same value of the width for both of the methods, 
h = 1), Thus, in all experiments we have used the Gaussian kernel with the 
same width and we have applied the same CV strategy’ for the selection of the 
number of used kernel PLS components and the values of v parameter for v- 
SVC. The final number of components and v value was set to be equal to the 
median of the five different estimates. 

In Table 1 we compare the achieved results with the results using different 
methods but with identical data partitioning [10, 8, 13]. We see very good results 
of the proposed KPLS-SVC method. We have further investigated the influence 
of the number of selected components on the overall accuracy of KPLS-SVC. For 
the fixed number of components the ’’optimal” value of the v parameter was set 
using the same CV strategy as described above. Results in Fig. 1 (right) show 
that when more than five PLS components are selected the method provides very 
consistent, low misclassification rates. Finally, in Fig. 2 we plot the projection 
of the data from both classes onto the direction found by kernel FDA, using the 
first component found by kernel PLS and the first component found by kernel 




Figure 1: left: An example of training patterns (first training data partition was 
used), right Dependence of the averaged misciassification error on a number of PLS 
components used. The standard deviation is represented by_the dotted lines. For a 
fixed number of components cross-validation (CV) was used to set v parameter for 
i'- SVC. The cross point indicates the minimum misciassification error achieved. Star 
indicates a misciassification error when both, number of components and v value were 
set by CV (see Table 1). 


PCA, respectively. While we see similarity and nice separation of two classes in 
the case of kernel FDA and kernel PLS, the kernel PCA method fails to separate 
the data using the first principal component. 


5 Conclusions 

A summary of the kernel PLS methodology in RKHS was provided. We have 
shown that the method may be useful for the modeling of existing relations 
between blocks of variables. With specific arrangement of one of the blocks of 
variables we may use the technique for nonlinear regression or discrimination 
problems. We have shown that the proposed technique of combining dimension- 
ality reduction by means of kernel PLS and discrimination of the classes using 
SVC methodology may result in performance comparable with the previously 
used classification techniques. Moreover, the projection of the high-dimensional 
feature space data onto a small number of necessary PLS components resulting 
in optimal or near optimal discrimination gives rise to the possibility of vi- 
sual inspection of data separability providing more useful insight into the data 
structure. Following the theoretical and practical results reported in [2] we also 
argue that kernel PLS would be preferred to kernel PCA when a feature space 
dimensionality reduction with respect to data discrimination is employed. The 





Figure 2: The values of top: data projected onto the direction found by kernel Fisher 
discriminant middle: the first kernel PLS component bottom: the first kernel PC 4. 
principal component. The data depicted in Fig. l(left) were used. 


proposed combination of kernel PLS with SVC can be useful in real world situ- 
ations where we can expect overlaps among different classes with non-Gaussian 
distribution. 
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